fix: Add schema validation for native_datafusion Parquet scan#3759
fix: Add schema validation for native_datafusion Parquet scan#3759vaibhawvipul wants to merge 7 commits intoapache:mainfrom
Conversation
When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader silently coerces incompatible types instead of erroring like Spark does.
|
Thanks for working on this @vaibhawvipul. This looks like a good start. Note that the behavior does vary between Spark versions. Spark 4 is much more permissive, for example. Could you add end-to-end integration tests, ideally using the new SQL file based testing approach or with Scala tests that compare Comet and Spark behavior. |
|
@vaibhawvipul you need to run "make format" to fix lint issues |
Thank you. Fixed. |
|
I'm tentative how we should proceed considering widening data types coerce support in Spark 4.0. Would it be better just to document that Comet in such cases allows coercion in Spark 3.x? 🤔 |
This is simpler for sure. We can document that Comet is more permissive than Spark 3.x . However, this PR keeps validation for clarly invalid cases regardless of the spark version. The validation isn't trying to be stricter than Spark 3.x - it's preventing DataFusion from silently producing wrong results for genuinely incompatible types. |
When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader silently coerces incompatible types instead of erroring like Spark does.
Which issue does this PR close?
Closes #3720 .
Rationale for this change
DataFusion is more permissive than Spark when reading Parquet files with mismatched schemas. For example, reading an INT32 column as bigint, or TimestampLTZ as TimestampNTZ, silently succeeds in DataFusion but should throw SchemaColumnConvertNotSupportedException per Spark's behavior. This breaks correctness guarantees that Spark users rely on.
What changes are included in this PR?
Adds schema compatibility validation in
schema_adapter.rs:validate_spark_schema_compatibility()checks each logical field against its physical counterpart when a file is openedis_spark_compatible_read()defines the allowlist of valid Parquet-to-Spark type conversions (matching TypeUtil's logic)"Column: [name], Expected: <type>, Found: <type>"formatHow are these changes tested?
parquet_int_as_long_should_fail- SPARK-35640: INT32 read as bigint is rejectedparquet_timestamp_ltz_as_ntz_should_fail- SPARK-36182: TimestampLTZ read as TimestampNTZ is rejectedparquet_roundtrip_unsigned_int- UInt32→Int32 (existing test, still passes)test_is_spark_compatible_read- unit test covering compatible cases (Binary→Utf8, UInt32→Int64, NTZ→LTZ, Timestamp→Int64) and incompatible cases (Utf8→Timestamp, Int32→Int64, LTZ→NTZ, Utf8→Int32, Float→Double, Decimal precision/scale mismatches)